Classification of Music Genres Based on MFCC Features¶

MIE1628 Assignment 5 - Part B¶

Prepared by Farhan Wadia¶

Introduction¶

FMA: A Dataset For Music Analysis is a dataset consisting of all audio and metadata for music available in the Free Music Archive. The dataset has over 100,000 different music tracks, and metadata such as album, artist, genre, country of origin, release date, and various pre-calculated features that are commonly used in audio engineering analyses.

The purpose of this project is to determine if mel-frequency spectral coefficients (MFCCs) can be used to determine the genre of a song. For any audio file, MFCCs can be calculated by taking a Short Time Fourier Transform (STFT) of the time series (going into the frequency domain), taking the logarithm of the Fourier transform's magnitude for each window where the Fourier Transform was applied by STFT (which would make the data now be in the quefrency domain), and then applying the Mel scale to that. The Mel scale is essentially a conversion to account for nonlinearities in how humans perceive pitch; at higher frequencies, pitches are perceived to be further away. For example, a 1000 Hz sound will be perceived as being further from a 900 Hz sound than a 400 Hz sound is from a 300 Hz sound, despite the difference being 100 Hz in both cases. Accordingly, MFCCs can be considered as representing distinct units of sound corrsponding to the shape of a person's vocal tract as they speak or sing. This makes MFCCs a commonly used feature for problems and applications related to speech recognition and speaker recognition (e.g. to predict the accent or gender of a speaker) [1].

For convenience, the FMA dataset already provides the calculated MFCCs for all of the tracks within it. Although the dataset is split into different subset sizes to make analysis easier to deal with when dealing with audio files, only the data in fma_metadata.zip will be used since that contains the MFCC features already, and the entire dataset's metadata is significantly smaller than the audio files for even the small subset (342 MB vs. 7.2 GB). The particular genres that will be considered for classification will be discussed in the Data Exploration & Pre-Processing section of this notebook.

Data Exploration & Pre-Processing¶

This section covers all data exploration and pre-processing completed. The structure of this section alternates between data cleanup and data exploration / visualizations to ensure that all data has been prepared properly before moving to model implementations.

Data Loading¶

Begin by loading the echonest.csv, features.csv, genres.csv, and tracks.csv files and printing a few rows to visualize their structure before cleaning them.

In [55]:
import pandas as pd

#Based on https://stackoverflow.com/a/68620427

echonest = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/echonest.csv?sp=r&st=2022-08-05T01:21:50Z&se=2022-08-08T09:21:50Z&spr=https&sv=2021-06-08&sr=b&sig=BFnIwi75Mi9lI8cJ43OVGr8AN3zqiTB8oimIKSZMHvw%3D')
echonest.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning:

Columns (0,1,2,3,4,5,6,7,8,11,13,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249) have mixed types.Specify dtype option on import or set low_memory=False.

Out[55]:
Unnamed: 0 echonest echonest.1 echonest.2 echonest.3 echonest.4 echonest.5 echonest.6 echonest.7 echonest.8 ... echonest.239 echonest.240 echonest.241 echonest.242 echonest.243 echonest.244 echonest.245 echonest.246 echonest.247 echonest.248
0 NaN audio_features audio_features audio_features audio_features audio_features audio_features audio_features audio_features metadata ... temporal_features temporal_features temporal_features temporal_features temporal_features temporal_features temporal_features temporal_features temporal_features temporal_features
1 NaN acousticness danceability energy instrumentalness liveness speechiness tempo valence album_date ... 214 215 216 217 218 219 220 221 222 223
2 track_id NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 0.4166752327 0.6758939853 0.6344762684 0.0106280683 0.1776465712 0.1593100648 165.9220000000 0.5766609880 NaN ... -1.9923025370 6.8056936264 0.2330697626 0.1928800046 0.0274549890 0.0640799999 3.6769599915 3.6128799915 13.3166904449 262.9297485352
4 3 0.3744077685 0.5286430621 0.8174611317 0.0018511032 0.1058799438 0.4618181276 126.9570000000 0.2692402421 NaN ... -1.5823311806 8.8893079758 0.2584637702 0.2209050059 0.0813684240 0.0641300008 6.0827698708 6.0186400414 16.6735477448 325.5810852051
5 5 0.0435668989 0.7455658702 0.7014699916 0.0006967990 0.3731433124 0.1245953419 100.2600000000 0.6216612236 NaN ... -2.2883579731 11.5271091461 0.2568213642 0.2378199995 0.0601223968 0.0601399988 5.9264898300 5.8663496971 16.0138492584 356.7557373047
6 10 0.9516699648 0.6581786543 0.9245251615 0.9654270154 0.1154738842 0.0329852191 111.5620000000 0.9635898919 2008-03-11 ... -3.6629877090 21.5082283020 0.2833518982 0.2670699954 0.1257044971 0.0808200017 8.4140100479 8.3331899643 21.3170642853 483.4038085938
7 134 0.4522173071 0.5132380502 0.5604099311 0.0194426943 0.0965666940 0.5255193792 114.2900000000 0.8940722715 NaN ... -1.4526963234 2.3563981056 0.2346863896 0.1995500028 0.1493317783 0.0644000024 11.2670698166 11.2026700974 26.4541797638 751.1477050781
8 139 0.1065495253 0.2609111726 0.6070668636 0.8350869898 0.2236762711 0.0305692764 196.9610000000 0.1602670903 NaN ... -3.0786671638 12.4115667343 0.2708015740 0.2727000117 0.0252420790 0.0640399978 2.4366900921 2.3726501465 3.8970954418 37.8660430908
9 140 0.3763124975 0.7340790229 0.2656847734 0.6695811237 0.0859951222 0.0390682262 107.9520000000 0.6099912728 NaN ... -0.9346956015 -0.2609805167 0.3222317100 0.2779799998 0.1367472708 0.0753299966 9.8627195358 9.7873897552 21.9816207886 562.2294311523

10 rows × 250 columns

In [56]:
features = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/features.csv?sp=r&st=2022-08-05T01:22:56Z&se=2022-08-08T09:22:56Z&spr=https&sv=2021-06-08&sr=b&sig=YGgywE7ZJ0X490qBwWdljCaRSj7VzXkik5rjJB1WqVo%3D')
features.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning:

Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518) have mixed types.Specify dtype option on import or set low_memory=False.

Out[56]:
feature chroma_cens chroma_cens.1 chroma_cens.2 chroma_cens.3 chroma_cens.4 chroma_cens.5 chroma_cens.6 chroma_cens.7 chroma_cens.8 ... tonnetz.39 tonnetz.40 tonnetz.41 zcr zcr.1 zcr.2 zcr.3 zcr.4 zcr.5 zcr.6
0 statistics kurtosis kurtosis kurtosis kurtosis kurtosis kurtosis kurtosis kurtosis kurtosis ... std std std kurtosis max mean median min skew std
1 number 01 02 03 04 05 06 07 08 09 ... 04 05 06 01 01 01 01 01 01 01
2 track_id NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 7.1806526184e+00 5.2303090096e+00 2.4932080507e-01 1.3476201296e+00 1.4824777842e+00 5.3137123585e-01 1.4815930128e+00 2.6914546490e+00 8.6686819792e-01 ... 5.4125156254e-02 1.2225749902e-02 1.2110591866e-02 5.7588901520e+00 4.5947265625e-01 8.5629448295e-02 7.1289062500e-02 0.0000000000e+00 2.0898721218e+00 6.1448108405e-02
4 3 1.8889633417e+00 7.6053929329e-01 3.4529656172e-01 2.2952005863e+00 1.6540306807e+00 6.7592434585e-02 1.3668476343e+00 1.0540937185e+00 1.0810308903e-01 ... 6.3831120729e-02 1.4211839065e-02 1.7740072682e-02 2.8246941566e+00 4.6630859375e-01 8.4578499198e-02 6.3964843750e-02 0.0000000000e+00 1.7167237997e+00 6.9330163300e-02
5 5 5.2756297588e-01 -7.7654317021e-02 -2.7961030602e-01 6.8588310480e-01 1.9375696182e+00 8.8083887100e-01 -9.2319184542e-01 -9.2723226547e-01 6.6661673784e-01 ... 4.0730185807e-02 1.2690781616e-02 1.4759079553e-02 6.8084154129e+00 3.7500000000e-01 5.3114086390e-02 4.1503906250e-02 0.0000000000e+00 2.1933031082e+00 4.4860601425e-02
6 10 3.7022454739e+00 -2.9119303823e-01 2.1967420578e+00 -2.3444947600e-01 1.3673638105e+00 9.9841135740e-01 1.7706941366e+00 1.6045658588e+00 5.2121698856e-01 ... 7.4357867241e-02 1.7951935530e-02 1.3921394013e-02 2.1434211731e+01 4.5214843750e-01 7.7514506876e-02 7.1777343750e-02 0.0000000000e+00 3.5423245430e+00 4.0800448507e-02
7 20 -1.9383698702e-01 -1.9852678478e-01 2.0154602826e-01 2.5855624676e-01 7.7520370483e-01 8.4794059396e-02 -2.8929358721e-01 -8.1641042233e-01 4.3850939721e-02 ... 9.5002755523e-02 2.2492416203e-02 2.1355332807e-02 1.6669036865e+01 4.6972656250e-01 4.7224905342e-02 4.0039062500e-02 9.7656250000e-04 3.1898307800e+00 3.0992921442e-02
8 26 -6.9953453541e-01 -6.8415790796e-01 4.8824872822e-02 4.2658798397e-02 -8.1896692514e-01 -9.1712284088e-01 -9.0183424950e-01 -6.6844828427e-02 -2.9103723168e-01 ... 1.0371652246e-01 2.5541320443e-02 2.3846302181e-02 4.1645809174e+01 2.5048828125e-01 1.8387714401e-02 1.5625000000e-02 0.0000000000e+00 4.6905956268e+00 1.4598459937e-02
9 30 -7.2148716450e-01 -8.4855991602e-01 8.9090377092e-01 8.8619679213e-02 -4.4551330805e-01 -1.2711701393e+00 -1.2401897907e+00 -1.3437650204e+00 -9.0560036898e-01 ... 1.4169253409e-01 2.0426128060e-02 2.5417611003e-02 8.1665945053e+00 5.4687500000e-01 5.4416511208e-02 3.6132812500e-02 2.4414062500e-03 2.2447082996e+00 5.2673552185e-02

10 rows × 519 columns

In [57]:
genres = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/genres.csv?sp=r&st=2022-08-05T01:23:51Z&se=2022-08-08T09:23:51Z&spr=https&sv=2021-06-08&sr=b&sig=2ks%2FlQh1gLMdw79lsgGmpyRzbWbzkXiX21%2FIGSJ3cos%3D')
genres.head(10)
Out[57]:
genre_id #tracks parent title top_level
0 1 8693 38 Avant-Garde 38
1 2 5271 0 International 2
2 3 1752 0 Blues 3
3 4 4126 0 Jazz 4
4 5 4106 0 Classical 5
5 6 914 38 Novelty 38
6 7 217 20 Comedy 20
7 8 868 0 Old-Time / Historic 8
8 9 1987 0 Country 9
9 10 13845 0 Pop 10
In [58]:
tracks = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/tracks.csv?sp=r&st=2022-08-05T01:28:19Z&se=2022-08-08T09:28:19Z&spr=https&sv=2021-06-08&sr=b&sig=3WPRQ7zcry5EOW6w5RdQZVri0xShN1bGos2o4m3U0jk%3D')
tracks.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning:

Columns (0,1,5,6,8,12,18,20,21,22,24,33,34,38,39,44,47,49) have mixed types.Specify dtype option on import or set low_memory=False.

Out[58]:
Unnamed: 0 album album.1 album.2 album.3 album.4 album.5 album.6 album.7 album.8 ... track.10 track.11 track.12 track.13 track.14 track.15 track.16 track.17 track.18 track.19
0 NaN comments date_created date_released engineer favorites id information listens producer ... information interest language_code license listens lyricist number publisher tags title
1 track_id NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 0 2008-11-26 01:44:45 2009-01-05 00:00:00 NaN 4 1 <p></p> 6073 NaN ... NaN 4656 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 1293 NaN 3 NaN [] Food
3 3 0 2008-11-26 01:44:45 2009-01-05 00:00:00 NaN 4 1 <p></p> 6073 NaN ... NaN 1470 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 514 NaN 4 NaN [] Electric Ave
4 5 0 2008-11-26 01:44:45 2009-01-05 00:00:00 NaN 4 1 <p></p> 6073 NaN ... NaN 1933 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 1151 NaN 6 NaN [] This World
5 10 0 2008-11-26 01:45:08 2008-02-06 00:00:00 NaN 4 6 NaN 47632 NaN ... NaN 54881 en Attribution-NonCommercial-NoDerivatives (aka M... 50135 NaN 1 NaN [] Freeway
6 20 0 2008-11-26 01:45:05 2009-01-06 00:00:00 NaN 2 4 <p> "spiritual songs" from Nicky Cook</p> 2710 NaN ... NaN 978 en Attribution-NonCommercial-NoDerivatives (aka M... 361 NaN 3 NaN [] Spiritual Level
7 26 0 2008-11-26 01:45:05 2009-01-06 00:00:00 NaN 2 4 <p> "spiritual songs" from Nicky Cook</p> 2710 NaN ... NaN 1060 en Attribution-NonCommercial-NoDerivatives (aka M... 193 NaN 4 NaN [] Where is your Love?
8 30 0 2008-11-26 01:45:05 2009-01-06 00:00:00 NaN 2 4 <p> "spiritual songs" from Nicky Cook</p> 2710 NaN ... NaN 718 en Attribution-NonCommercial-NoDerivatives (aka M... 612 NaN 5 NaN [] Too Happy
9 46 0 2008-11-26 01:45:05 2009-01-06 00:00:00 NaN 2 4 <p> "spiritual songs" from Nicky Cook</p> 2710 NaN ... NaN 252 en Attribution-NonCommercial-NoDerivatives (aka M... 171 NaN 8 NaN [] Yosemite

10 rows × 53 columns

Data Cleanup¶

Based on the results above, the dataframe headers need to be cleaned, and attributes that will not likely be needed can be removed.

Echonest¶

Only the acousticness, danceability, energy, instrumentalness, liveness, speechiness, tempo, and valence features are needed from echonest. Descriptions of these features can be found in the Spotify API documentation, and have been calculated by Echo Nest (which is now a part of Spotify) using their own machine learning methods. These features will not be used for classification, but rather to understand during the data exploration stage how these features are distributed among the genres.

In [59]:
#Echonest cleaning

#Make 2nd row as header
echonest.columns = echonest.iloc[1] 
echonest = echonest[2:]
echonest.rename(columns={echonest.columns[0]: "track_id"}, inplace = True)

#Remove blank row
echonest = echonest[1:]

#Keep desired features only
echonest = echonest[['track_id', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'tempo', 'valence']]

#Ensure data is numeric to help later on
echonest = echonest.apply(pd.to_numeric)
In [60]:
# Verify that dataframe looks right
echonest.head()
Out[60]:
1 track_id acousticness danceability energy instrumentalness liveness speechiness tempo valence
3 2 0.416675 0.675894 0.634476 0.010628 0.177647 0.159310 165.922 0.576661
4 3 0.374408 0.528643 0.817461 0.001851 0.105880 0.461818 126.957 0.269240
5 5 0.043567 0.745566 0.701470 0.000697 0.373143 0.124595 100.260 0.621661
6 10 0.951670 0.658179 0.924525 0.965427 0.115474 0.032985 111.562 0.963590
7 134 0.452217 0.513238 0.560410 0.019443 0.096567 0.525519 114.290 0.894072

Features¶

Only keep the track ID and mfcc features since the other features are outside the scope of this project. The track IDs are necessary to be able to lookup genres. Note that FMA provides the mean, median, minimum, maximum, standard deviation, skew, and kurtosis for the first 20 MFCCs for each track. In this case, only the mean of each track's MFCC coefficient will be used.

In [61]:
features.rename(columns={features.columns[0]: "track_id"}, inplace = True)

#Keep track_id and MFCC statistics
cols = ["track_id"]
cols.extend([f for f in features.columns.values if "mfcc" in f])
mfccs = features[cols]

#Only keep means and track_id
cols = mfccs.loc[0] == "mean"
cols["track_id"] = True
mfccs = mfccs.loc[:, cols]

new_headers = ["track_id"]
new_headers.extend(["mfcc." + str(x) for x in range(1, len(mfccs.columns))])
mfccs.columns = new_headers

mfccs = mfccs[3:]
mfccs = mfccs.dropna()

mfccs.head()
Out[61]:
track_id mfcc.1 mfcc.2 mfcc.3 mfcc.4 mfcc.5 mfcc.6 mfcc.7 mfcc.8 mfcc.9 ... mfcc.11 mfcc.12 mfcc.13 mfcc.14 mfcc.15 mfcc.16 mfcc.17 mfcc.18 mfcc.19 mfcc.20
3 2 -1.6377296448e+02 1.1669667816e+02 -4.1753826141e+01 2.9144329071e+01 -1.5050157547e+01 1.8879371643e+01 -8.9181652069e+00 1.2002118111e+01 -4.2531509399e+00 ... -2.6829998493e+00 -7.9463183880e-01 -6.9209713936e+00 -3.6553659439e+00 1.4652130604e+00 2.0107804239e-01 3.9982039928e+00 -2.1146764755e+00 1.1684176326e-01 -5.7858843803e+00
4 3 -1.5900416565e+02 1.2015850067e+02 -3.3233562469e+01 4.7342002869e+01 -6.2473182678e+00 3.1405355453e+01 -5.2618112564e+00 1.1618971825e+01 -1.5958366394e+00 ... -3.4226787090e+00 6.9492840767e+00 -4.1752557755e+00 -3.5288145542e+00 2.7471557260e-01 -2.2706823349e+00 1.0904747248e+00 -2.3438842297e+00 4.7182095051e-01 -1.5467071533e+00
5 5 -2.0544049072e+02 1.3221507263e+02 -1.6085823059e+01 4.1514759064e+01 -7.6429538727e+00 1.6942802429e+01 -5.6512613297e+00 9.5694446564e+00 5.0315696001e-01 ... -8.2713766098e+00 5.9447294474e-01 -3.4020280838e-01 2.3778877258e+00 7.8994874954e+00 1.9476414919e+00 7.4419503212e+00 -1.7399110794e+00 2.7801498771e-01 -5.4890155792e+00
6 10 -1.3586482239e+02 1.5704008484e+02 -5.3453247070e+01 1.7198896408e+01 6.8680348396e+00 1.3934344292e+01 -1.1749298096e+01 8.3607110977e+00 -5.1303811073e+00 ... -5.4212064743e+00 1.6794785261e+00 -6.2182493210e+00 1.8441945314e+00 -4.0997042656e+00 7.7994996309e-01 -5.5957680941e-01 -1.0183241367e+00 -3.8075449467e+00 -6.7953306437e-01
7 20 -1.3513589478e+02 1.1481417847e+02 1.2354539871e+01 1.9764219284e+01 1.8670799255e+01 1.9643861771e+01 3.5725092888e+00 1.2124897003e+01 -2.2851834297e+00 ... -8.0546313524e-01 4.0829424858e+00 2.1424494684e-01 3.8759169579e+00 -2.3532356322e-01 3.9029249549e-01 -5.7247143984e-01 2.7791724205e+00 2.4312584400e+00 3.0311167240e+00

5 rows × 21 columns

Genres¶

The genres dataframe is already cleaned, but for reference and easier visualization later, it would be good to extract the title names of all the parent genres and top level genres. This will then be used to make a decision on the genres that will be used for classification.

In [62]:
# Get unique parent IDs
parent_ids = set(genres[['parent']].values.reshape(-1))

#Get unique top_level IDs
top_level_ids = set(genres[['top_level']].values.reshape(-1))

# Form dataframes
parent_genres = genres.loc[genres['genre_id'].isin(parent_ids)]
top_level_genres = genres.loc[genres['genre_id'].isin(top_level_ids)]
In [63]:
print("The dataset contains the following", len(genres), "genres: \n")
print(genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 163 genres: 

['Avant-Garde', 'International', 'Blues', 'Jazz', 'Classical', 'Novelty', 'Comedy', 'Old-Time / Historic', 'Country', 'Pop', 'Disco', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Sound Effects', 'Folk', 'Soundtrack', 'Funk', 'Spoken', 'Hip-Hop', 'Audio Collage', 'Punk', 'Post-Rock', 'Lo-Fi', 'Field Recordings', 'Metal', 'Noise', 'Psych-Folk', 'Krautrock', 'Jazz: Vocal', 'Experimental', 'Electroacoustic', 'Ambient Electronic', 'Radio Art', 'Loud-Rock', 'Latin America', 'Drone', 'Free-Folk', 'Noise-Rock', 'Psych-Rock', 'Bluegrass', 'Electro-Punk', 'Radio', 'Indie-Rock', 'Industrial', 'No Wave', 'Free-Jazz', 'Experimental Pop', 'French', 'Reggae - Dub', 'Afrobeat', 'Nerdcore', 'Garage', 'Indian', 'New Wave', 'Post-Punk', 'Sludge', 'African', 'Freak-Folk', 'Jazz: Out', 'Progressive', 'Alternative Hip-Hop', 'Death-Metal', 'Middle East', 'Singer-Songwriter', 'Ambient', 'Hardcore', 'Power-Pop', 'Space-Rock', 'Polka', 'Balkan', 'Unclassifiable', 'Europe', 'Americana', 'Spoken Weird', 'Interview', 'Black-Metal', 'Rockabilly', 'Easy Listening: Vocal', 'Brazilian', 'Asia-Far East', 'N. Indian Traditional', 'South Indian Traditional', 'Bollywood', 'Pacific', 'Celtic', 'Be-Bop', 'Big Band/Swing', 'British Folk', 'Techno', 'House', 'Glitch', 'Minimal Electronic', 'Breakcore - Hard', 'Sound Poetry', '20th Century Classical', 'Poetry', 'Talk Radio', 'North African', 'Sound Collage', 'Flamenco', 'IDM', 'Chiptune', 'Musique Concrete', 'Improv', 'New Age', 'Trip-Hop', 'Dance', 'Chip Music', 'Lounge', 'Goth', 'Composed Music', 'Drum & Bass', 'Shoegaze', 'Kid-Friendly', 'Thrash', 'Synth Pop', 'Banter', 'Deep Funk', 'Spoken Word', 'Chill-out', 'Bigbeat', 'Surf', 'Radio Theater', 'Grindcore', 'Rock Opera', 'Opera', 'Chamber Music', 'Choral Music', 'Symphony', 'Minimalism', 'Musical Theater', 'Dubstep', 'Skweee', 'Western Swing', 'Downtempo', 'Cumbia', 'Latin', 'Sound Art', 'Romany (Gypsy)', 'Compilation', 'Rap', 'Breakbeat', 'Gospel', 'Abstract Hip-Hop', 'Reggae - Dancehall', 'Spanish', 'Country & Western', 'Contemporary Classical', 'Wonky', 'Jungle', 'Klezmer', 'Holiday', 'Salsa', 'Nu-Jazz', 'Hip-Hop Beats', 'Modern Jazz', 'Turkish', 'Tango', 'Fado', 'Christmas', 'Instrumental']
In [64]:
print("The dataset contains the following", len(parent_genres), "parent genres: \n")
print(parent_genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 39 parent genres: 

['International', 'Blues', 'Jazz', 'Classical', 'Novelty', 'Country', 'Pop', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Sound Effects', 'Folk', 'Soundtrack', 'Funk', 'Spoken', 'Hip-Hop', 'Punk', 'Post-Rock', 'Metal', 'Experimental', 'Loud-Rock', 'Latin America', 'Noise-Rock', 'Radio', 'Reggae - Dub', 'Garage', 'Indian', 'African', 'Middle East', 'Hardcore', 'Europe', 'Techno', 'House', 'Chip Music', 'Dubstep', 'Country & Western', 'Holiday', 'Instrumental']
In [65]:
print("The dataset contains the following", len(top_level_genres), "top-level genres: \n")
print(top_level_genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 16 top-level genres: 

['International', 'Blues', 'Jazz', 'Classical', 'Old-Time / Historic', 'Country', 'Pop', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Folk', 'Spoken', 'Hip-Hop', 'Experimental', 'Instrumental']

Looking at the top level genres, all seem uniquely distinct and to be an appropriate number of broad genres to be able to differentiate songs from one another. The only exception to this is the "International" top-level genre, which consists of 7 parent genres which should be considered distinct in their own right. For example, although European and Indian music are both international, both would be expected to sound completely different! To make the model more robust to recognize such a difference, and so that these unique genres don't get lumped together, the international top-level genre should be replaced by its parent genres. It should be noted that by doing this, there will still be an International genre, but genres like Latin America, Middle East, and Indian will be distinct from it.

In [66]:
international_genres_df = parent_genres.loc[parent_genres['top_level'] == 2]
international_genres_df["title"].values.tolist()
Out[66]:
['International',
 'Latin America',
 'Reggae - Dub',
 'Indian',
 'African',
 'Middle East',
 'Europe']
In [67]:
# Create dataframe of the genres that will be considered for classification
classification_genres = pd.concat([ top_level_genres[top_level_genres['title'] != "International"], 
                                    parent_genres.loc[parent_genres['top_level'] == 2]
                                    ], ignore_index=True)

classification_genres
Out[67]:
genre_id #tracks parent title top_level
0 3 1752 0 Blues 3
1 4 4126 0 Jazz 4
2 5 4106 0 Classical 5
3 8 868 0 Old-Time / Historic 8
4 9 1987 0 Country 9
5 10 13845 0 Pop 10
6 12 32923 0 Rock 12
7 13 730 0 Easy Listening 13
8 14 1499 0 Soul-RnB 14
9 15 34413 0 Electronic 15
10 17 12706 0 Folk 17
11 20 1876 0 Spoken 20
12 21 8389 0 Hip-Hop 21
13 38 38154 0 Experimental 38
14 1235 14938 0 Instrumental 1235
15 2 5271 0 International 2
16 46 573 2 Latin America 2
17 79 880 2 Reggae - Dub 2
18 86 216 2 Indian 2
19 92 329 2 African 2
20 102 176 2 Middle East 2
21 130 727 2 Europe 2
In [68]:
print("The", len(classification_genres), "genres listed above will be considered for the classification task.")
The 22 genres listed above will be considered for the classification task.

Tracks¶

Update the column headers for the tracks dataframe and then pre-process to assign genres properly.

In [69]:
pre = [f.split(".")[0] + "_" for f in tracks.columns] # keep the first word as a prefix for the feature (e.g. album, track, artist)
tracks_feature_names = [f for f in tracks.iloc[0]]
columns = [str(pre[i]) + str(tracks_feature_names[i]) for i in range(len(tracks_feature_names))]
columns[0] = "track_id"

tracks.columns = columns
tracks = tracks[2:]
In [70]:
# Print updated column headers for reference 
tracks.columns.values
Out[70]:
array(['track_id', 'album_comments', 'album_date_created',
       'album_date_released', 'album_engineer', 'album_favorites',
       'album_id', 'album_information', 'album_listens', 'album_producer',
       'album_tags', 'album_title', 'album_tracks', 'album_type',
       'artist_active_year_begin', 'artist_active_year_end',
       'artist_associated_labels', 'artist_bio', 'artist_comments',
       'artist_date_created', 'artist_favorites', 'artist_id',
       'artist_latitude', 'artist_location', 'artist_longitude',
       'artist_members', 'artist_name', 'artist_related_projects',
       'artist_tags', 'artist_website', 'artist_wikipedia_page',
       'set_split', 'set_subset', 'track_bit_rate', 'track_comments',
       'track_composer', 'track_date_created', 'track_date_recorded',
       'track_duration', 'track_favorites', 'track_genre_top',
       'track_genres', 'track_genres_all', 'track_information',
       'track_interest', 'track_language_code', 'track_license',
       'track_listens', 'track_lyricist', 'track_number',
       'track_publisher', 'track_tags', 'track_title'], dtype=object)

Based on the feature names above, only track_id, track_genre_top, track_genres, track_genres_all, and track_language_code would be expected to be useful. All other features can be removed.

In [71]:
tracks = tracks[['track_id', 'track_genre_top', 'track_genres', 'track_genres_all', 'track_language_code']]

# Verify that the dataframe looks ok
tracks.head(20)
Out[71]:
track_id track_genre_top track_genres track_genres_all track_language_code
2 2 Hip-Hop [21] [21] en
3 3 Hip-Hop [21] [21] en
4 5 Hip-Hop [21] [21] en
5 10 Pop [10] [10] en
6 20 NaN [76, 103] [17, 10, 76, 103] en
7 26 NaN [76, 103] [17, 10, 76, 103] en
8 30 NaN [76, 103] [17, 10, 76, 103] en
9 46 NaN [76, 103] [17, 10, 76, 103] en
10 48 NaN [76, 103] [17, 10, 76, 103] en
11 134 Hip-Hop [21] [21] en
12 135 Rock [45, 58] [58, 12, 45] en
13 136 Rock [45, 58] [58, 12, 45] en
14 137 Experimental [1, 32] [32, 1, 38] en
15 138 Experimental [1, 32] [32, 1, 38] en
16 139 Folk [17] [17] en
17 140 Folk [17] [17] en
18 141 Folk [17] [17] en
19 142 Folk [17] [17] en
20 144 Jazz [4] [4] en
21 145 Jazz [4] [4] en

Based on the above, a track can have multiple possible genres. However, for simplicity, notice that track_genre_top already contains the genres that were selected previously, with the exception of International needing to be replaced by one of the lower level parent genres, and NaN values. The NaN values in track_genre_top correspond to when there is no clear distinction for what type of genre the track is, so assume those rows can be removed to simplify the dataset. Additionally, track_genres and track_genres_all need to be converted from strings to lists to make them possible to work with.

In [72]:
import ast

# Convert strings to list
tracks["track_genres"] = tracks["track_genres"].apply(ast.literal_eval)
tracks["track_genres_all"] = tracks["track_genres_all"].apply(ast.literal_eval)
In [73]:
# Remove rows with track_genre_top = NaN
tracks = tracks[tracks['track_genre_top'].notna()]
In [74]:
# Look at some rows of the International genre before defining logic to classify into the narrower genres
intnl_tracks = tracks[tracks["track_genre_top"] == "International"]
intnl_tracks.head(50)
Out[74]:
track_id track_genre_top track_genres track_genres_all track_language_code
434 666 International [79] [2, 79] en
435 667 International [79] [2, 79] en
472 704 International [46] [2, 46] es
473 705 International [46] [2, 46] es
474 706 International [46] [2, 46] es
475 707 International [46] [2, 46] es
476 708 International [46] [2, 46] es
477 709 International [46] [2, 46] es
607 853 International [2] [2] en
821 1082 International [2] [2] en
1339 1680 International [2] [2] en
1340 1681 International [2] [2] en
1341 1682 International [2] [2] en
1342 1683 International [2] [2] en
1343 1684 International [2] [2] en
1344 1685 International [2] [2] en
1345 1686 International [2] [2] en
1346 1687 International [2] [2] en
1347 1688 International [2] [2] en
1348 1689 International [2] [2] en
1909 3586 International [2] [2] en
1910 3587 International [2] [2] en
1911 3588 International [2] [2] en
1912 3589 International [2] [2] en
1913 3590 International [2] [2] en
2076 3774 International [46, 117] [2, 117, 46] en
2077 3775 International [46, 117] [2, 117, 46] en
2078 3776 International [46, 117] [2, 117, 46] en
2079 3777 International [46, 117] [2, 117, 46] en
2080 3778 International [46, 117] [2, 117, 46] en
2081 3779 International [46, 117] [2, 117, 46] en
2191 3895 International [118] [2, 118] en
2192 3896 International [118] [2, 118] en
2193 3897 International [118] [2, 118] en
2194 3898 International [118] [2, 118] en
2195 3899 International [118] [2, 118] en
2337 4070 International [46] [2, 46] en
2338 4071 International [46] [2, 46] en
2339 4072 International [46] [2, 46] en
2340 4073 International [46] [2, 46] en
2341 4074 International [46] [2, 46] en
2342 4075 International [46] [2, 46] en
2343 4076 International [46] [2, 46] en
2344 4077 International [46] [2, 46] en
2345 4078 International [46] [2, 46] en
2346 4079 International [46] [2, 46] en
2347 4080 International [46] [2, 46] en
2348 4081 International [46] [2, 46] en
2358 4091 International [117, 118, 130] [2, 117, 118, 130] en
2359 4092 International [117, 118, 130] [2, 117, 118, 130] en

Based on the above, explode track_genres since track_genres_all duplicates the international ID of 2. For any rows where track_genres!=2, replace track_genre_top with the narrower genre title (if it exists). Finally, filter out unused genres which are not being used as classification genres (these would show as NaN errors in the mapping). This is acceptable because with exploding track_genres, one of the values should correspond to the desired parent, and the other values will correspond to the more specific genre titles not being used.

Finally, duplicate track IDs should be removed since they would correspond to tracks where the genre is still unclear (e.g. a track which is both African and Middle Eastern). The purpose of this is to aid with classification by mapping tracks to a single genre.

In [75]:
# Define dictionary of international genres with key as genre id and value as title
international_genres_dict = dict(zip(international_genres_df["genre_id"], international_genres_df["title"]))

print(international_genres_dict)
{2: 'International', 46: 'Latin America', 79: 'Reggae - Dub', 86: 'Indian', 92: 'African', 102: 'Middle East', 130: 'Europe'}
In [76]:
# Explode based on track_genres
intnl_tracks_exploded = intnl_tracks.explode('track_genres')
In [77]:
# Replace track_genre_top with narrower label if available
intnl_tracks_exploded["track_genre_top"] = intnl_tracks_exploded["track_genres"].map(international_genres_dict)

# Remove NaN values arising from mapping (i.e. genre will not be used)
intnl_tracks_exploded = intnl_tracks_exploded[intnl_tracks_exploded['track_genre_top'].notna()]
In [78]:
# Only keep unique track ids
intnl_tracks_exploded = intnl_tracks_exploded[intnl_tracks_exploded.duplicated(['track_id']) == False]

# Verify that all track ids are unique
if not intnl_tracks_exploded.duplicated(subset=['track_id']).any():
    print("No duplicate track ids are there in the international subset")
No duplicate track ids are there in the international subset
In [79]:
# Print intnl_tracks_exploded to verify that it looks ok
intnl_tracks_exploded.head(10)
Out[79]:
track_id track_genre_top track_genres track_genres_all track_language_code
434 666 Reggae - Dub 79 [2, 79] en
435 667 Reggae - Dub 79 [2, 79] en
472 704 Latin America 46 [2, 46] es
473 705 Latin America 46 [2, 46] es
474 706 Latin America 46 [2, 46] es
475 707 Latin America 46 [2, 46] es
476 708 Latin America 46 [2, 46] es
477 709 Latin America 46 [2, 46] es
607 853 International 2 [2] en
821 1082 International 2 [2] en

Now that the international tracks have been handled, they can be merged with the remaining non-international tracks. In doing the merge, only the track_id and track_genre_top are essential to keep. After that, we will verify that the values of track_genre_top align with the genres chosen for classification.

In [80]:
non_intnl_tracks = tracks[tracks["track_genre_top"] != "International"]

tracks_cleaned = pd.concat([non_intnl_tracks[["track_id", "track_genre_top"]], 
                            intnl_tracks_exploded[["track_id", "track_genre_top"]]], 
                            ignore_index=True)

tracks_cleaned = tracks_cleaned.rename(columns={'track_genre_top': 'genre'})
In [81]:
#Verify that tracks cleaned looks ok
tracks_cleaned.head(20)
Out[81]:
track_id genre
0 2 Hip-Hop
1 3 Hip-Hop
2 5 Hip-Hop
3 10 Pop
4 134 Hip-Hop
5 135 Rock
6 136 Rock
7 137 Experimental
8 138 Experimental
9 139 Folk
10 140 Folk
11 141 Folk
12 142 Folk
13 144 Jazz
14 145 Jazz
15 146 Jazz
16 147 Jazz
17 148 Experimental
18 149 Experimental
19 150 Experimental
In [82]:
#Verify that tracks_cleaned does not contain any genres outside of those chosen for classification

classification_genres_set = set(classification_genres['title'].values)

tracks_cleaned_genres_set = set(tracks_cleaned['genre'].values)

if classification_genres_set.intersection(tracks_cleaned_genres_set) == classification_genres_set:
    print("Genres in tracks_cleaned are valid")
Genres in tracks_cleaned are valid

Data Pre-Processing¶

The purpose of this section is to further cleanse the dataset to make visualization easier and to prepare the data into a form that is conducive for use in a classification or clustering model.

To begin, a consolidated dataset consisting of the tracks and genres selected for analysis, their MFCC features, and the echonest features should be created.

To assist with classification, the MFCC features will also be processed by PCA as shown in the example here to see if the MFCC features can be represented in a reduced form that lowers the data dimensionality. Prior to doing this, the features would need to be standardized to help improve results.

In [83]:
# Begin by filtering the mfccs dataframe to only the tracks that are part of tracks_cleaned.
# Drop the tracks with missing MFCC features
# To assist with visualization later, also bring in the echonest metrics for these tracks

dataset = pd.merge(left=tracks_cleaned, right=mfccs, on="track_id", how="left")
dataset = dataset.dropna() # drop tracks with missing MFCC features

dataset = pd.merge(left=dataset, right=echonest, on="track_id", how="left") # bring in echonest features

# Verify that the dataset looks ok
dataset.head(15)
Out[83]:
track_id genre mfcc.1 mfcc.2 mfcc.3 mfcc.4 mfcc.5 mfcc.6 mfcc.7 mfcc.8 ... mfcc.19 mfcc.20 acousticness danceability energy instrumentalness liveness speechiness tempo valence
0 2 Hip-Hop -1.6377296448e+02 1.1669667816e+02 -4.1753826141e+01 2.9144329071e+01 -1.5050157547e+01 1.8879371643e+01 -8.9181652069e+00 1.2002118111e+01 ... 1.1684176326e-01 -5.7858843803e+00 NaN NaN NaN NaN NaN NaN NaN NaN
1 3 Hip-Hop -1.5900416565e+02 1.2015850067e+02 -3.3233562469e+01 4.7342002869e+01 -6.2473182678e+00 3.1405355453e+01 -5.2618112564e+00 1.1618971825e+01 ... 4.7182095051e-01 -1.5467071533e+00 NaN NaN NaN NaN NaN NaN NaN NaN
2 5 Hip-Hop -2.0544049072e+02 1.3221507263e+02 -1.6085823059e+01 4.1514759064e+01 -7.6429538727e+00 1.6942802429e+01 -5.6512613297e+00 9.5694446564e+00 ... 2.7801498771e-01 -5.4890155792e+00 NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Pop -1.3586482239e+02 1.5704008484e+02 -5.3453247070e+01 1.7198896408e+01 6.8680348396e+00 1.3934344292e+01 -1.1749298096e+01 8.3607110977e+00 ... -3.8075449467e+00 -6.7953306437e-01 NaN NaN NaN NaN NaN NaN NaN NaN
4 134 Hip-Hop -2.0766148376e+02 1.2552130890e+02 -3.3416591644e+01 3.2260929108e+01 8.0747709274e+00 1.5349553108e+01 -4.0741791725e+00 1.0281721115e+01 ... -7.7040648460e-01 -3.9955995083e+00 NaN NaN NaN NaN NaN NaN NaN NaN
5 135 Rock -9.0879714966e+01 1.5976300049e+02 -4.2893623352e+01 3.5776615143e+01 -1.8252986908e+01 2.0433145523e+01 -7.9369482994e+00 1.2992751122e+01 ... 1.2024710178e+00 -2.4587099552e+00 NaN NaN NaN NaN NaN NaN NaN NaN
6 136 Rock -8.4803161621e+01 1.4372821045e+02 -6.7442865372e+00 2.5492109299e+01 5.0692691803e+00 1.6982337952e+01 -2.4718496799e+00 5.3463969231e+00 ... -1.7590852976e+00 4.0495863557e-01 NaN NaN NaN NaN NaN NaN NaN NaN
7 137 Experimental -1.2391856384e+02 1.5573527527e+02 -8.0915206909e+01 2.7569656372e+01 1.0932379723e+01 1.9220283508e+01 -1.8276212692e+01 -1.8288209915e+01 ... -1.0690842867e+00 1.7065395117e+00 NaN NaN NaN NaN NaN NaN NaN NaN
8 138 Experimental -7.9164611816e+01 8.5144149780e+01 -3.1939628601e+01 3.0368049622e+01 3.4229247570e+00 1.2789516449e+01 -1.5265024185e+01 -1.3450626373e+01 ... -1.6415536404e+00 1.6512305737e+00 NaN NaN NaN NaN NaN NaN NaN NaN
9 139 Folk -1.2750725555e+02 1.5288587952e+02 -5.8565074921e+01 4.9597194672e+01 -6.6043100357e+00 2.2506578445e+01 -7.1333312988e+00 9.7062120438e+00 ... -5.1429504156e-01 2.6310398579e+00 NaN NaN NaN NaN NaN NaN NaN NaN
10 140 Folk -2.2571331787e+02 1.3933282471e+02 -1.3097699165e+01 4.4533355713e+01 2.4683995247e+00 2.8328742981e+01 -9.9314813614e+00 1.0810856819e+01 ... -1.5860234201e-01 5.9409761429e-01 NaN NaN NaN NaN NaN NaN NaN NaN
11 141 Folk -2.5314390564e+02 1.5571632385e+02 -1.6636627197e+01 2.3683815002e+01 6.0459570885e+00 1.1692952156e+01 -9.9477605820e+00 6.8878135681e+00 ... 2.8093214035e+00 3.3257400990e+00 NaN NaN NaN NaN NaN NaN NaN NaN
12 142 Folk -1.5323365784e+02 1.3514985657e+02 -4.9444625854e+01 4.2056404114e+01 -1.9741883278e+00 1.4290251732e+01 -8.3063608408e-01 1.0496274948e+01 ... 6.0231194496e+00 5.0714325905e+00 NaN NaN NaN NaN NaN NaN NaN NaN
13 144 Jazz -1.2692814636e+02 1.2631162262e+02 -3.1843872070e+01 4.5561306000e+01 -3.8294014335e-01 1.0514379501e+01 -1.1236815453e+01 7.0307722092e+00 ... -1.0537501574e+00 1.1049208641e+00 NaN NaN NaN NaN NaN NaN NaN NaN
14 145 Jazz -1.3589109802e+02 1.2842012024e+02 -3.3427680969e+01 4.3987606049e+01 -4.1240496635e+00 1.7416790009e+01 -9.8095483780e+00 8.7755756378e+00 ... 1.0460131168e+00 -1.6675879955e+00 NaN NaN NaN NaN NaN NaN NaN NaN

15 rows × 30 columns

In [84]:
import numpy as np
import sklearn as skl
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Standardization and PCA code adapted from https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

# Standardize the mfcc features of the dataset
X = dataset.loc[:, [c for c in dataset.columns if "mfcc" in c]]
X_std = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=20)
principalComponents = pca.fit_transform(X_std)

explainedVariance = pca.explained_variance_ratio_
cumulativeEV = np.cumsum(explainedVariance)

Data Visualization¶

Begin by looking at how the dataset is distibuted among the genres:

In [85]:
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook"

track_counts = dataset.groupby('genre', as_index=False).count().sort_values(by='track_id', ascending=False)

fig = px.bar(track_counts, x='genre', y='track_id', labels = {"genre": "Genre", "track_id":"Count"}, title='Count of Tracks by Genre')
fig.show()
In [86]:
print("There are", len(dataset), "tracks in the dataset")
There are 40129 tracks in the dataset

Based on the above chart, the top 5 genres are rock, experimental, electronic, hip-hop and folk. Additionally, it should be noted that although the International genre was split into multiple genres, classification might not work well for each of these sub genres solely because the number of tracks is limited (i.e. Europe, African, Middle East, and Indian) have the fewest tracks.

Create a scree plot to visualize the effectiveness of using different choices on the number of principal components in explaining the MFCC feature variance:

In [87]:
fig = px.line(x=np.arange(1, 21, 1), y=cumulativeEV, labels=dict(x="PC", y="Explained Variance"), markers=True,
              color=px.Constant("Cumulative Explained Variance"), title="PCA Scree Plot")
fig.add_bar(x=np.arange(1, 21, 1), y=explainedVariance, name="Explained Variance")

fig.show()

The scree plot shows that just over half of the dataset's variance can be explained by the first 3 PCs. The first 10 PCs explain a bit over 80% of the variance in the MFCCs.

For the next visualizations, plot the first 2 and first 3 PCs labelled by genre to see if any clusters are apparent. All plots in the remainder of this section can be filtered as needed by clicking on the series' in the legend. Double clicking on a series will isolate it from all others, whereas single clicking will toggle it on/off.

In [88]:
PCs_df = pd.DataFrame(data=principalComponents[:, :3], columns=["PC1", "PC2", "PC3"])

tracks_and_genres = dataset[["track_id", "genre"]]

PCs_df = pd.concat([PCs_df.reset_index(drop=True), tracks_and_genres.reset_index(drop=True)], axis=1)
In [89]:
fig = px.scatter(PCs_df, x="PC1", y="PC2", color="genre", symbol="genre", title="Top 2 MFCC Principal Components by Genre")
fig.show()
In [90]:
fig = px.scatter_3d(PCs_df, x="PC1", y="PC2", z="PC3", color='genre')

fig.update_layout(margin=dict(l=10, r=10, b=10, t=40), title='Top 3 MFCC Principal Components by Genre')
fig.update_scenes(xaxis_autorange="reversed", yaxis_autorange="reversed")
fig.show()

Looking at the principal components of MFCCs by genre plots, there are regions where certain genres are more likely to be, but there do not appear to be clear cluster boundaries or consistency in the shapes of the clusters for each genre. Some general observations on distinguishing between the genres are provided below:

  • Experimental seems to have the broadest variance across PC1 and is spreadout over the x-axis. Given that the genre is experimental, this result makes sense since experimental music would, by definition, be a collection of different, unique tracks that don't fit into established categories, and would be expected to vary from track to track since there is no pre-defined "style" to it.
  • Visually, the clusters that would likely be easiest to separate from each other are classical, electronic, and rock. It should also be noted that rock and electronic are the top and third-most popular genres in the dataset respectively.
  • Electronic and jazz are towards the right of the 2D plot; rock is towards the top; and international, Latin American, and folk are towards the left of the chart. This would suggest that PC1 is representing a more "guitary" song to the left, and more "scattered" (electronic, jazzy) sound to the right. Accordingly, this PC might be representing the timbre of a song from perceptions of different instrument types.

For the final visualizations, the echonest features will be looked at in order to understand how music genres vary based on qualities such as acousticness, danceability, energy, instrumentalness, liveness, speechiness, tempo, valence. Although these features will not be used for classification, looking at them at this stage should provide clearer insights on how music genres vary. Note that not all genres will be present in each chart due to missing data in the echonest features.

In [91]:
# Calculate averages of echonest features and drop genres with missing info
echonest_averages = dataset.groupby('genre', as_index=False).agg({"acousticness": "mean", 
                                                         "danceability":"mean",
                                                         "energy":"mean",
                                                         "instrumentalness":"mean",
                                                         "liveness":"mean",
                                                         "speechiness":"mean",
                                                         "tempo":"mean",
                                                         "valence":"mean"})
echonest_averages = echonest_averages.dropna()
echonest_averages
Out[91]:
genre acousticness danceability energy instrumentalness liveness speechiness tempo valence
1 Blues 0.883875 0.485561 0.390235 0.183444 0.122580 0.037252 119.513600 0.505450
2 Classical 0.986494 0.318438 0.054674 0.715784 0.218769 0.059107 99.240562 0.239131
5 Electronic 0.280795 0.590354 0.637690 0.749167 0.168776 0.097246 125.108284 0.431572
7 Experimental 0.599967 0.573539 0.430228 0.506951 0.164140 0.091255 123.640412 0.617081
8 Folk 0.747726 0.465819 0.343948 0.543683 0.154615 0.057183 118.175960 0.346635
9 Hip-Hop 0.363376 0.646300 0.586759 0.337519 0.188337 0.254353 117.676421 0.595177
11 Instrumental 0.577847 0.498469 0.502319 0.531401 0.202884 0.098989 114.778583 0.427666
12 International 0.788740 0.530034 0.424812 0.567316 0.188242 0.170889 125.104935 0.641766
13 Jazz 0.755227 0.383942 0.328465 0.702578 0.171593 0.082645 109.688016 0.287190
16 Old-Time / Historic 0.963293 0.503834 0.252559 0.637894 0.331007 0.142882 118.115613 0.560074
17 Pop 0.482880 0.574947 0.489717 0.375738 0.156410 0.061806 121.014888 0.433490
19 Rock 0.382597 0.393707 0.664871 0.600970 0.193606 0.064910 126.942851 0.413129
In [92]:
# Plot the averages for each genre
# Note that tempo should be considered separately in isolation because the values are not scaled like the others
fig = px.line(  echonest_averages, x="genre", 
                y=['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'tempo', 'valence'],
                title="Averages of Echonest Features by Genre")

# Hide tempo by default so that it doesn't skew the axes
for trace in fig['data']:
    if trace['name'] in ["tempo"]:
        trace['visible'] = 'legendonly'

fig.show()
In [93]:
# Obtain the standard deviations of each series above. Higher standard deviation would indicate more variance between genres for a given feature.
echonest_averages.std().sort_values(ascending=False)
Out[93]:
tempo               7.716330
acousticness        0.239611
energy              0.171386
instrumentalness    0.168242
valence             0.129432
danceability        0.095593
speechiness         0.061185
liveness            0.051566
dtype: float64

Looking at the results above, tempo, acousticness, energy, and instrumentalness have the most variance. Some observations from the plots are noted below:

  • The tempo, speechiness, and acousticness plots show that rock and classical are quite distinct from each other and would likely be easiest to differentiate from each other. The PCA plots also appear to show this distinction with rock at the high end of PC2 and classical at the low end of PC2
  • In terms of characteristics for particular genres:
    • Blues have low instrumentalness and high acousticness
    • Classical, old-time/historic, and folk to a lesser degree have high instrumentalness, but low energy
    • Hip-hop is the most speechy
    • Rock and electronic have the most energy

Given the results above, it would also be interesting to see how tempo, acousticness, energy, and instrumentalness are distributed among the genres as compared to looking at their averages, especially considering that they have the highest variability. To do this, use a KDE estimate rather than histograms (so the diagrams don't get too cluttered) and plot the probability density (i.e. area under each curve should be 1).

In [94]:
import math 
import plotly.figure_factory as ff

# Helper function for plotting features distributions for genres with available data
def plotFeatureDistribution(dataset, feature_name):
    f_name = feature_name.title()

    #Get dictionary with genre as key and list of feature values (e.g. acousticness) as values
    by_genre_dict = dataset.groupby("genre")[feature_name].apply(list).to_dict()

    #Form the dictinary into a list of lists. Remove NaNs in each inner list
    by_genre = [[x for x in by_genre_dict[k] if math.isnan(x) == False] for k in by_genre_dict.keys()]

    # Remove genres where the list became empty and keep track of which genres are available 
    genres_available = []
    by_genre_available = []
    for i, genre in enumerate(by_genre):
        if by_genre[i]: #List is not empty
            by_genre_available.append(by_genre[i])
            genres_available.append(list(by_genre_dict.keys())[i])

    fig = ff.create_distplot(by_genre_available, genres_available, show_hist=False)
    fig.update_layout(title_text='Distribution of ' + f_name + ' by Genre', xaxis=dict(title=f_name), yaxis=dict(title="Probability Density"))

    fig.show()
In [95]:
plotFeatureDistribution(dataset, 'tempo')

For tempo, hip-hop and pop seem to have the lowest variance. However, hip-hop does seem to have a bimodal distribution with the main peak around 95 BPM, and a smaller peak around 180 BPM. This could be representative of hip-hop consisting of multiple styles where some are faster than others (i.e. rap music, which is also a subgenre of hip-hop in this dataset). Rock appears to have high variance in tempo, and in general, blues typically seem to have faster tempos (lower variance than rock and appearing towards the right of the figure).

In [96]:
plotFeatureDistribution(dataset, 'acousticness')

Classical clearly appears to have the lowest variance and is clustered around a score of 0.99, which is very similar to the average result in the Average of Echonest Features by Genre plot. Filtering that out by clicking on it in the legend and looking at the remaining curves, blues and old-time/historic have low variance, but are skewed to the right. Electronic and rock generally have low acousticness, while folk and jazz have high acousticness. Instrumental, experimental, and pop have fairly uniform distributions of acousticness.

In [97]:
plotFeatureDistribution(dataset, 'energy')

Classical has the lowest energy and variance. The variance for old-time/historical and blues are also genrally lower than the other genres, with old-time/historical peaking around 0.2 and blues peaking around 0.4. Rock and electronic have fairly uniform distributions, but concentrated slightly to the high side.

In [98]:
plotFeatureDistribution(dataset, 'instrumentalness')

Electronic, interestingly, is considered to have high instrumentalness, and also appears to have the lowest variance, followed by rock. Distributions for hip-hop and pop are skewed to the left on the other hand, and experimental appears to have the most uniform distribution of instrumentalness. Experimental also had a fairly uniform distribution for acousticness, which makes sense given the nature of this genre as discussed earlier in the comments for the PCA plots.

Model Development¶

For genre classification based on the MFCC features, the 2 models that will be developed are a k-nearest neighbours (KNN) model and a random forest model. The KNN model has been chosen due to its simplicity for implementation, and because looking at the PCA visualizations, there does seem to be like there are particular regions where certain genres lie to help distinguish them from other genres. For example, considering hip-hop, folk, and old-time/historic in isolation on the PCA plots, it does seem like each of these genres occupy distinct regions. However, given the relatively large number of genres in the classification list, there are also lots of regions with overlap and no clear distinction. By using a KNN model, the hope is that choosing the class which has a majority in the k-nearest neighbours will lead to good results.

A random forest was chosen as the second model to provide results with less bias as compared to using a single decision tree, but again, since it would be relatively simple to implement. Although non-linear models such as a feedforward neural network would likely provide more accurate results, the goal with these models is to first see if simpler models suffice. As a future extension to this project, classification with a neural network could be interesting to implement.

Note that hyperparameter tuning will be done for both models within this same section.

Pre-Processing for Model Training¶

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

#Split the data into training, validation, and test sets 70-15-15
X_train_val, X_test, y_train_val, y_test = train_test_split(dataset.loc[:, [c for c in dataset.columns if "mfcc" in c]], dataset["genre"], test_size=0.15, random_state=21)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=15/85, random_state=21)

#Standardize the data
scaler = StandardScaler()
scaler.fit(X_train)

X_train_std = scaler.fit_transform(X_train)
X_val_std = scaler.fit_transform(X_val)
X_test_std = scaler.fit_transform(X_test)

#Apply PCA on the MFCC features
pca = PCA(n_components=20)
pca.fit(X_train_std)

X_train_pca = pca.transform(X_train_std)
X_val_pca = pca.transform(X_val_std)
X_test_pca = pca.transform(X_test_std)

KNN Model¶

Train a KNN model and vary k and the number of PCA components to use in the model. Keep the model that performs best on the validation set.

In [100]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

bestValAccuracy = 0
bestK = -1
bestNumFeatures = -1

for k in range(15, 25):
    for n_components in range(4, 11):
        #Fit to training data
        KNNmodel = KNeighborsClassifier(n_neighbors=k)
        KNNmodel.fit(X_train_pca[:, :n_components], y_train)

        y_pred_train = KNNmodel.predict(X_train_pca[:, :n_components])
        y_pred_val = KNNmodel.predict(X_val_pca[:, :n_components])

        train_acc = metrics.accuracy_score(y_train, y_pred_train)
        val_acc = metrics.accuracy_score(y_val, y_pred_val)

        if val_acc > bestValAccuracy:
            bestValAccuracy = val_acc
            bestK = k
            bestNumFeatures = n_components

        print("k:", k, "| Number of components:", n_components, "| Training Accuracy:", train_acc, "| Validation Accuracy:", val_acc)

print("\nThe best model on the validation set used k =", bestK, "and", bestNumFeatures, "PCA components. The validation set accuracy is", bestValAccuracy)
k: 15 | Number of components: 4 | Training Accuracy: 0.5075652390615544 | Validation Accuracy: 0.45348837209302323
k: 15 | Number of components: 5 | Training Accuracy: 0.5331268468083591 | Validation Accuracy: 0.4818936877076412
k: 15 | Number of components: 6 | Training Accuracy: 0.538360212182705 | Validation Accuracy: 0.48322259136212625
k: 15 | Number of components: 7 | Training Accuracy: 0.5460500551817438 | Validation Accuracy: 0.5006644518272425
k: 15 | Number of components: 8 | Training Accuracy: 0.5548435330556446 | Validation Accuracy: 0.5034883720930232
k: 15 | Number of components: 9 | Training Accuracy: 0.5600768984299904 | Validation Accuracy: 0.5129568106312292
k: 15 | Number of components: 10 | Training Accuracy: 0.5658442806792695 | Validation Accuracy: 0.5129568106312292
k: 16 | Number of components: 4 | Training Accuracy: 0.5053223681868347 | Validation Accuracy: 0.4569767441860465
k: 16 | Number of components: 5 | Training Accuracy: 0.5284987005589377 | Validation Accuracy: 0.48438538205980064
k: 16 | Number of components: 6 | Training Accuracy: 0.5352629143080921 | Validation Accuracy: 0.4840531561461794
k: 16 | Number of components: 7 | Training Accuracy: 0.5454448360568194 | Validation Accuracy: 0.5024916943521595
k: 16 | Number of components: 8 | Training Accuracy: 0.5515326284310584 | Validation Accuracy: 0.5029900332225914
k: 16 | Number of components: 9 | Training Accuracy: 0.5588664601801417 | Validation Accuracy: 0.5093023255813953
k: 16 | Number of components: 10 | Training Accuracy: 0.5646338424294207 | Validation Accuracy: 0.5144518272425249
k: 17 | Number of components: 4 | Training Accuracy: 0.5020826658122397 | Validation Accuracy: 0.45614617940199337
k: 17 | Number of components: 5 | Training Accuracy: 0.5279290825590088 | Validation Accuracy: 0.4845514950166113
k: 17 | Number of components: 6 | Training Accuracy: 0.5318452063085194 | Validation Accuracy: 0.48754152823920266
k: 17 | Number of components: 7 | Training Accuracy: 0.5438783865570151 | Validation Accuracy: 0.5029900332225914
k: 17 | Number of components: 8 | Training Accuracy: 0.5500373811812453 | Validation Accuracy: 0.5026578073089701
k: 17 | Number of components: 9 | Training Accuracy: 0.5576916230552885 | Validation Accuracy: 0.5089700996677741
k: 17 | Number of components: 10 | Training Accuracy: 0.5592224714300972 | Validation Accuracy: 0.507641196013289
k: 18 | Number of components: 4 | Training Accuracy: 0.5019402613122574 | Validation Accuracy: 0.4558139534883721
k: 18 | Number of components: 5 | Training Accuracy: 0.5261846274342269 | Validation Accuracy: 0.48322259136212625
k: 18 | Number of components: 6 | Training Accuracy: 0.530919577058635 | Validation Accuracy: 0.4885382059800664
k: 18 | Number of components: 7 | Training Accuracy: 0.5407810886824024 | Validation Accuracy: 0.5038205980066445
k: 18 | Number of components: 8 | Training Accuracy: 0.5483641283064545 | Validation Accuracy: 0.5051495016611296
k: 18 | Number of components: 9 | Training Accuracy: 0.5542027128057246 | Validation Accuracy: 0.5074750830564784
k: 18 | Number of components: 10 | Training Accuracy: 0.5569083983053864 | Validation Accuracy: 0.5083056478405316
k: 19 | Number of components: 4 | Training Accuracy: 0.4995193848125601 | Validation Accuracy: 0.45714285714285713
k: 19 | Number of components: 5 | Training Accuracy: 0.5238705543095162 | Validation Accuracy: 0.48255813953488375
k: 19 | Number of components: 6 | Training Accuracy: 0.5309551781836306 | Validation Accuracy: 0.48803986710963454
k: 19 | Number of components: 7 | Training Accuracy: 0.538431414432696 | Validation Accuracy: 0.5043189368770764
k: 19 | Number of components: 8 | Training Accuracy: 0.546584072056677 | Validation Accuracy: 0.5006644518272425
k: 19 | Number of components: 9 | Training Accuracy: 0.5516750329310406 | Validation Accuracy: 0.5088039867109635
k: 19 | Number of components: 10 | Training Accuracy: 0.556231976930471 | Validation Accuracy: 0.5109634551495017
k: 20 | Number of components: 4 | Training Accuracy: 0.4988785645626402 | Validation Accuracy: 0.45880398671096345
k: 20 | Number of components: 5 | Training Accuracy: 0.5228025205596497 | Validation Accuracy: 0.48438538205980064
k: 20 | Number of components: 6 | Training Accuracy: 0.531239987183595 | Validation Accuracy: 0.4883720930232558
k: 20 | Number of components: 7 | Training Accuracy: 0.5372565773078429 | Validation Accuracy: 0.5044850498338871
k: 20 | Number of components: 8 | Training Accuracy: 0.544768414681904 | Validation Accuracy: 0.5016611295681063
k: 20 | Number of components: 9 | Training Accuracy: 0.5507850048061519 | Validation Accuracy: 0.5101328903654485
k: 20 | Number of components: 10 | Training Accuracy: 0.5547011285556623 | Validation Accuracy: 0.5104651162790698
k: 21 | Number of components: 4 | Training Accuracy: 0.49862935668767133 | Validation Accuracy: 0.4589700996677741
k: 21 | Number of components: 5 | Training Accuracy: 0.5200612339349924 | Validation Accuracy: 0.48272425249169437
k: 21 | Number of components: 6 | Training Accuracy: 0.528712307308911 | Validation Accuracy: 0.4877076411960133
k: 21 | Number of components: 7 | Training Accuracy: 0.5366157570579231 | Validation Accuracy: 0.5021594684385382
k: 21 | Number of components: 8 | Training Accuracy: 0.5438783865570151 | Validation Accuracy: 0.49966777408637875
k: 21 | Number of components: 9 | Training Accuracy: 0.5493965609313254 | Validation Accuracy: 0.5073089700996678
k: 21 | Number of components: 10 | Training Accuracy: 0.550856207056143 | Validation Accuracy: 0.5098006644518273
k: 22 | Number of components: 4 | Training Accuracy: 0.4978461319377692 | Validation Accuracy: 0.46146179401993354
k: 22 | Number of components: 5 | Training Accuracy: 0.5192068069350991 | Validation Accuracy: 0.4850498338870432
k: 22 | Number of components: 6 | Training Accuracy: 0.5273594645590801 | Validation Accuracy: 0.48471760797342195
k: 22 | Number of components: 7 | Training Accuracy: 0.5344084873081989 | Validation Accuracy: 0.5014950166112957
k: 22 | Number of components: 8 | Training Accuracy: 0.5425255438071843 | Validation Accuracy: 0.5021594684385382
k: 22 | Number of components: 9 | Training Accuracy: 0.547687706931539 | Validation Accuracy: 0.5086378737541528
k: 22 | Number of components: 10 | Training Accuracy: 0.5496457688062942 | Validation Accuracy: 0.5101328903654485
k: 23 | Number of components: 4 | Training Accuracy: 0.4949624408131297 | Validation Accuracy: 0.46312292358803986
k: 23 | Number of components: 5 | Training Accuracy: 0.519064402435117 | Validation Accuracy: 0.48754152823920266
k: 23 | Number of components: 6 | Training Accuracy: 0.5257930150592759 | Validation Accuracy: 0.4885382059800664
k: 23 | Number of components: 7 | Training Accuracy: 0.5326996333084125 | Validation Accuracy: 0.5014950166112957
k: 23 | Number of components: 8 | Training Accuracy: 0.5421339314322332 | Validation Accuracy: 0.5041528239202658
k: 23 | Number of components: 9 | Training Accuracy: 0.5456584428067927 | Validation Accuracy: 0.5073089700996678
k: 23 | Number of components: 10 | Training Accuracy: 0.5505001958061875 | Validation Accuracy: 0.5117940199335548
k: 24 | Number of components: 4 | Training Accuracy: 0.4946064295631742 | Validation Accuracy: 0.4632890365448505
k: 24 | Number of components: 5 | Training Accuracy: 0.5176403574352949 | Validation Accuracy: 0.4878737541528239
k: 24 | Number of components: 6 | Training Accuracy: 0.5250097903093738 | Validation Accuracy: 0.48903654485049836
k: 24 | Number of components: 7 | Training Accuracy: 0.5310263804336217 | Validation Accuracy: 0.4978405315614618
k: 24 | Number of components: 8 | Training Accuracy: 0.5395706504325537 | Validation Accuracy: 0.5039867109634552
k: 24 | Number of components: 9 | Training Accuracy: 0.5444124034319484 | Validation Accuracy: 0.506312292358804
k: 24 | Number of components: 10 | Training Accuracy: 0.5492185553063477 | Validation Accuracy: 0.5119601328903655

The best model on the validation set used k = 16 and 10 PCA components. The validation set accuracy is 0.5144518272425249
In [101]:
# Plot a confusion matrix for the classification counts in the validation set

from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff

# Recreate best model and obtain validation set predictions
bestKNNModel = KNeighborsClassifier(n_neighbors=bestK)
bestKNNModel.fit(X_train_pca[:, :bestNumFeatures], y_train)
y_pred_val = bestKNNModel.predict(X_val_pca[:, :bestNumFeatures])

lbls = classification_genres["title"].values.tolist()

z = confusion_matrix(y_val, y_pred_val, labels=lbls)

fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()

The KNN model with the highest accuracy on the validation set used k=16 and the first 10 PCA components of the MFCC features; validation set accuracy was 51.4% for this model, and training set accuracy was 56.5%. This indicates that the model has been fit reasonably and is not overfit. Although an accuracy of 51.4% may seem low, it is important to keep in mind that since there are 22 genres, a random classifier would be expected to have an accuracy, on average, of 1/22 = 4.5%. With this relatively simple KNN model, validation set accuracy is more than 10x that of a baseline, random classifier identifying genres by chance.

The following observations are made based on the confusion matrix:

  • Rock is correctly classified the most, although it should be noted from the data visualization section that the dataset is imbalanced and rock is the most common class.
  • Blues, jazz, country, pop, easy-listening, soul R&B, spoken, hip-hop, instrumental, and all of the international genres are classified incorrectly more often than they are classified correctly
  • Rock, electronic, and experimental are the most commonly predicted classes in order, but also the top 3 most common classes in the dataset.
  • Ultimately, the flaws with this model are likely related to the unbalanced training dataset on which it was trained

Random Forest Model¶

To try a model that is slightly more complex than the KNN, but not as complex as a feedforward neural network, a random forest model will be developed. Try using different criteria to measure split quality, the number of samples for splitting an internal node, and the number of PCA components to use as the hyperparameters to be be tuned.

Note that choosing between 4 and 10 PCA components has been implemented for both models in order to reduce dimensionality of the dataset (as compared to using all 20 features).

In [102]:
from sklearn.ensemble import RandomForestClassifier

bestValAccuracy = 0
bestCrit = ""
bestMinSamplesSplit = -1
bestNumFeatures = -1

for crit in ["gini", "entropy"]:
    for mss in range(10, 100, 10):
        for n_components in range(4, 11, 2):
            RFmodel = RandomForestClassifier(n_estimators=100, criterion=crit, min_samples_split = mss, random_state=21)
            RFmodel.fit(X_train_pca[:, :n_components], y_train)

            y_pred_train = RFmodel.predict(X_train_pca[:, :n_components])
            y_pred_val = RFmodel.predict(X_val_pca[:, :n_components])

            train_acc = metrics.accuracy_score(y_train, y_pred_train)
            val_acc = metrics.accuracy_score(y_val, y_pred_val)

            if val_acc > bestValAccuracy:
                bestValAccuracy = val_acc
                bestCrit = crit
                bestMinSamplesSplit = mss
                bestNumFeatures = n_components

            print("Criterion:", crit, "| Number of components:", n_components, "| min_samples_split =", mss, "| Training Accuracy:", train_acc, "| Validation Accuracy:", val_acc)

print("\nThe best model on the validation set used ", bestCrit, "criterion, min_samples_split =", bestMinSamplesSplit, "and", bestNumFeatures, "PCA components. The validation set accuracy is", bestValAccuracy)
Criterion: gini | Number of components: 4 | min_samples_split = 10 | Training Accuracy: 0.8013457225248318 | Validation Accuracy: 0.4700996677740864
Criterion: gini | Number of components: 6 | min_samples_split = 10 | Training Accuracy: 0.8449214995193848 | Validation Accuracy: 0.4998338870431894
Criterion: gini | Number of components: 8 | min_samples_split = 10 | Training Accuracy: 0.870767916266154 | Validation Accuracy: 0.5176079734219269
Criterion: gini | Number of components: 10 | min_samples_split = 10 | Training Accuracy: 0.8971483498878564 | Validation Accuracy: 0.5337209302325582
Criterion: gini | Number of components: 4 | min_samples_split = 20 | Training Accuracy: 0.6529958346683755 | Validation Accuracy: 0.47475083056478407
Criterion: gini | Number of components: 6 | min_samples_split = 20 | Training Accuracy: 0.693616718288298 | Validation Accuracy: 0.5071428571428571
Criterion: gini | Number of components: 8 | min_samples_split = 20 | Training Accuracy: 0.7176830787852896 | Validation Accuracy: 0.509468438538206
Criterion: gini | Number of components: 10 | min_samples_split = 20 | Training Accuracy: 0.7389013492826373 | Validation Accuracy: 0.524750830564784
Criterion: gini | Number of components: 4 | min_samples_split = 30 | Training Accuracy: 0.6008757876748905 | Validation Accuracy: 0.47840531561461797
Criterion: gini | Number of components: 6 | min_samples_split = 30 | Training Accuracy: 0.6366549182954181 | Validation Accuracy: 0.5003322259136213
Criterion: gini | Number of components: 8 | min_samples_split = 30 | Training Accuracy: 0.655808323543024 | Validation Accuracy: 0.5109634551495017
Criterion: gini | Number of components: 10 | min_samples_split = 30 | Training Accuracy: 0.6738224927907722 | Validation Accuracy: 0.5209302325581395
Criterion: gini | Number of components: 4 | min_samples_split = 40 | Training Accuracy: 0.572679696678415 | Validation Accuracy: 0.47790697674418603
Criterion: gini | Number of components: 6 | min_samples_split = 40 | Training Accuracy: 0.6022998326747125 | Validation Accuracy: 0.49900332225913624
Criterion: gini | Number of components: 8 | min_samples_split = 40 | Training Accuracy: 0.6207056142974118 | Validation Accuracy: 0.5051495016611296
Criterion: gini | Number of components: 10 | min_samples_split = 40 | Training Accuracy: 0.6372245362953469 | Validation Accuracy: 0.5166112956810631
Criterion: gini | Number of components: 4 | min_samples_split = 50 | Training Accuracy: 0.5560539713054933 | Validation Accuracy: 0.479734219269103
Criterion: gini | Number of components: 6 | min_samples_split = 50 | Training Accuracy: 0.5828260173021468 | Validation Accuracy: 0.5024916943521595
Criterion: gini | Number of components: 8 | min_samples_split = 50 | Training Accuracy: 0.5997721528000285 | Validation Accuracy: 0.509468438538206
Criterion: gini | Number of components: 10 | min_samples_split = 50 | Training Accuracy: 0.6136565915482929 | Validation Accuracy: 0.5162790697674419
Criterion: gini | Number of components: 4 | min_samples_split = 60 | Training Accuracy: 0.5445904090569262 | Validation Accuracy: 0.476578073089701
Criterion: gini | Number of components: 6 | min_samples_split = 60 | Training Accuracy: 0.5703656235537043 | Validation Accuracy: 0.4995016611295681
Criterion: gini | Number of components: 8 | min_samples_split = 60 | Training Accuracy: 0.5832176296770978 | Validation Accuracy: 0.5066445182724253
Criterion: gini | Number of components: 10 | min_samples_split = 60 | Training Accuracy: 0.5981345010502331 | Validation Accuracy: 0.5167774086378738
Criterion: gini | Number of components: 4 | min_samples_split = 70 | Training Accuracy: 0.5342304816832212 | Validation Accuracy: 0.4777408637873754
Criterion: gini | Number of components: 6 | min_samples_split = 70 | Training Accuracy: 0.5581544376802307 | Validation Accuracy: 0.5001661129568107
Criterion: gini | Number of components: 8 | min_samples_split = 70 | Training Accuracy: 0.5716828651785396 | Validation Accuracy: 0.5053156146179402
Criterion: gini | Number of components: 10 | min_samples_split = 70 | Training Accuracy: 0.584463669051942 | Validation Accuracy: 0.5119601328903655
Criterion: gini | Number of components: 4 | min_samples_split = 80 | Training Accuracy: 0.52767987468404 | Validation Accuracy: 0.4744186046511628
Criterion: gini | Number of components: 6 | min_samples_split = 80 | Training Accuracy: 0.5489693474313788 | Validation Accuracy: 0.5019933554817275
Criterion: gini | Number of components: 8 | min_samples_split = 80 | Training Accuracy: 0.5628893873046388 | Validation Accuracy: 0.5046511627906977
Criterion: gini | Number of components: 10 | min_samples_split = 80 | Training Accuracy: 0.5720388764284952 | Validation Accuracy: 0.5119601328903655
Criterion: gini | Number of components: 4 | min_samples_split = 90 | Training Accuracy: 0.5208800598098899 | Validation Accuracy: 0.4777408637873754
Criterion: gini | Number of components: 6 | min_samples_split = 90 | Training Accuracy: 0.542703549432162 | Validation Accuracy: 0.49883720930232556
Criterion: gini | Number of components: 8 | min_samples_split = 90 | Training Accuracy: 0.55551995443056 | Validation Accuracy: 0.503156146179402
Criterion: gini | Number of components: 10 | min_samples_split = 90 | Training Accuracy: 0.5655238705543095 | Validation Accuracy: 0.5129568106312292
Criterion: entropy | Number of components: 4 | min_samples_split = 10 | Training Accuracy: 0.7616504681547936 | Validation Accuracy: 0.4729235880398671
Criterion: entropy | Number of components: 6 | min_samples_split = 10 | Training Accuracy: 0.8140553241482431 | Validation Accuracy: 0.501328903654485
Criterion: entropy | Number of components: 8 | min_samples_split = 10 | Training Accuracy: 0.8397237352700345 | Validation Accuracy: 0.5074750830564784
Criterion: entropy | Number of components: 10 | min_samples_split = 10 | Training Accuracy: 0.8715155398910606 | Validation Accuracy: 0.5267441860465116
Criterion: entropy | Number of components: 4 | min_samples_split = 20 | Training Accuracy: 0.6129445690483819 | Validation Accuracy: 0.47524916943521595
Criterion: entropy | Number of components: 6 | min_samples_split = 20 | Training Accuracy: 0.6498629356687672 | Validation Accuracy: 0.4998338870431894
Criterion: entropy | Number of components: 8 | min_samples_split = 20 | Training Accuracy: 0.6749973299156253 | Validation Accuracy: 0.5083056478405316
Criterion: entropy | Number of components: 10 | min_samples_split = 20 | Training Accuracy: 0.6946135497881734 | Validation Accuracy: 0.5176079734219269
Criterion: entropy | Number of components: 4 | min_samples_split = 30 | Training Accuracy: 0.5647050446794118 | Validation Accuracy: 0.4785714285714286
Criterion: entropy | Number of components: 6 | min_samples_split = 30 | Training Accuracy: 0.5952508099255936 | Validation Accuracy: 0.5001661129568107
Criterion: entropy | Number of components: 8 | min_samples_split = 30 | Training Accuracy: 0.6157214567980348 | Validation Accuracy: 0.5034883720930232
Criterion: entropy | Number of components: 10 | min_samples_split = 30 | Training Accuracy: 0.6323827832959521 | Validation Accuracy: 0.5166112956810631
Criterion: entropy | Number of components: 4 | min_samples_split = 40 | Training Accuracy: 0.5404250774324468 | Validation Accuracy: 0.4744186046511628
Criterion: entropy | Number of components: 6 | min_samples_split = 40 | Training Accuracy: 0.569368792053829 | Validation Accuracy: 0.5
Criterion: entropy | Number of components: 8 | min_samples_split = 40 | Training Accuracy: 0.5840364555519955 | Validation Accuracy: 0.5051495016611296
Criterion: entropy | Number of components: 10 | min_samples_split = 40 | Training Accuracy: 0.5981701021752287 | Validation Accuracy: 0.5124584717607974
Criterion: entropy | Number of components: 4 | min_samples_split = 50 | Training Accuracy: 0.5263982341842002 | Validation Accuracy: 0.4790697674418605
Criterion: entropy | Number of components: 6 | min_samples_split = 50 | Training Accuracy: 0.5520310441809961 | Validation Accuracy: 0.4995016611295681
Criterion: entropy | Number of components: 8 | min_samples_split = 50 | Training Accuracy: 0.5666631065541671 | Validation Accuracy: 0.5039867109634552
Criterion: entropy | Number of components: 10 | min_samples_split = 50 | Training Accuracy: 0.5802627363024672 | Validation Accuracy: 0.5111295681063123
Criterion: entropy | Number of components: 4 | min_samples_split = 60 | Training Accuracy: 0.5164299191854462 | Validation Accuracy: 0.4775747508305648
Criterion: entropy | Number of components: 6 | min_samples_split = 60 | Training Accuracy: 0.5419203246822599 | Validation Accuracy: 0.4958471760797342
Criterion: entropy | Number of components: 8 | min_samples_split = 60 | Training Accuracy: 0.5554131510555733 | Validation Accuracy: 0.503156146179402
Criterion: entropy | Number of components: 10 | min_samples_split = 60 | Training Accuracy: 0.5662714941792161 | Validation Accuracy: 0.5109634551495017
Criterion: entropy | Number of components: 4 | min_samples_split = 70 | Training Accuracy: 0.5106625369361671 | Validation Accuracy: 0.47807308970099666
Criterion: entropy | Number of components: 6 | min_samples_split = 70 | Training Accuracy: 0.5323792231834525 | Validation Accuracy: 0.4951827242524917
Criterion: entropy | Number of components: 8 | min_samples_split = 70 | Training Accuracy: 0.5428815550571398 | Validation Accuracy: 0.49767441860465117
Criterion: entropy | Number of components: 10 | min_samples_split = 70 | Training Accuracy: 0.555911566805511 | Validation Accuracy: 0.5046511627906977
Criterion: entropy | Number of components: 4 | min_samples_split = 80 | Training Accuracy: 0.5040407276869949 | Validation Accuracy: 0.4744186046511628
Criterion: entropy | Number of components: 6 | min_samples_split = 80 | Training Accuracy: 0.5261846274342269 | Validation Accuracy: 0.49601328903654485
Criterion: entropy | Number of components: 8 | min_samples_split = 80 | Training Accuracy: 0.5368293638078964 | Validation Accuracy: 0.496843853820598
Criterion: entropy | Number of components: 10 | min_samples_split = 80 | Training Accuracy: 0.5474028979315746 | Validation Accuracy: 0.5049833887043189
Criterion: entropy | Number of components: 4 | min_samples_split = 90 | Training Accuracy: 0.4985225533126847 | Validation Accuracy: 0.476578073089701
Criterion: entropy | Number of components: 6 | min_samples_split = 90 | Training Accuracy: 0.5195628181850547 | Validation Accuracy: 0.49601328903654485
Criterion: entropy | Number of components: 8 | min_samples_split = 90 | Training Accuracy: 0.5305991669336751 | Validation Accuracy: 0.4961794019933555
Criterion: entropy | Number of components: 10 | min_samples_split = 90 | Training Accuracy: 0.5414931111823134 | Validation Accuracy: 0.5028239202657807

The best model on the validation set used  gini criterion, min_samples_split = 10 and 10 PCA components. The validation set accuracy is 0.5337209302325582
In [103]:
# Plot a confusion matrix for the classification counts in the validation set

# Recreate best model and obtain validation set predictions
bestRFModel = RandomForestClassifier(n_estimators=100, criterion=bestCrit, min_samples_split = bestMinSamplesSplit, random_state=21)
bestRFModel.fit(X_train_pca[:, :bestNumFeatures], y_train)
y_pred_val = bestRFModel.predict(X_val_pca[:, :bestNumFeatures])

lbls = classification_genres["title"].values.tolist()

z = confusion_matrix(y_val, y_pred_val, labels=lbls)

fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()

The random forest model with the highest accuracy on the validation set used the gini criterion, min_samples_split=10, and the first 10 PCA components of the MFCC features. Validation set accuracy was 53.4%, and training accuracy was 89.7%. Although the validation set accuracy is 2% higher here than for the KNN, this model is clearly much more overfit to the training data. Therefore, for testing, it would be better to use the KNN model since it is not as likely to be overfit.

The following observations are made based on the confusion matrix:

  • Rock is correctly classified the most, although it should be noted from the data visualization section that the dataset is imbalanced and rock is the most common class.
  • Blues, jazz, country, pop, easy-listening, soul R&B, folk, spoken, hip-hop, instrumental, and all of the international genres are classified incorrectly more often than they are classified correctly. Note that this is the same result as for the KNN model, with the exception that the KNN model was able to classify folk correctly more often than not.
  • Rock, electronic, and experimental are still the most commonly predicted classes in order
  • Flaws with this model are related to the unbalanced training dataset as well as overfitting.

Model Testing¶

Based on the observations made above, test on the KNN model.

In [104]:
y_pred_test = bestKNNModel.predict(X_test_pca[:, :10])
test_acc = metrics.accuracy_score(y_test, y_pred_test)

print("The accuracy of the KNN model on the test set is", test_acc)
The accuracy of the KNN model on the test set is 0.5102990033222591
In [105]:
lbls = classification_genres["title"].values.tolist()

z = confusion_matrix(y_test, y_pred_test, labels=lbls)

fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()

The accuracy on the test set is 51.0%, which is very similar to the validation set accuracy for this model (51.4%). This indicates that the model is not overfit, and although seemingly low, this result is still acceptable given it is 10x more accurate than a classifier choosing at random, and given the large number of classification genres.

The following observations are made based on the confusion matrix:

  • Rock is correctly classified the most, although it should be noted from the data visualization section that the dataset is imbalanced and rock is the most common class.
  • Blues, jazz, country, pop, easy-listening, soul R&B, spoken, hip-hop, instrumental, and all of the international genres are classified incorrectly more often than they are classified correctly. This is the same result as for the validation set for this model.
  • Rock, electronic, and experimental are still the most commonly predicted classes in order

For visualization purposes, the plots below show the first 2 and first 3 PCA components of the MFCC features segmented by actual vs. predicted class. These plots can be filtered as desired in order to gain insight about some of the challenges in segmenting between genres. For example, comparing correctly classified rock with experimental that was predicted to be rock in the graphs below, there is no clear-cut decision boundary apparent.

In [106]:
PCs_df_test = pd.DataFrame(data=X_test_pca[:, :3], columns=["PC1", "PC2", "PC3"])
actual_and_predictions_test = pd.DataFrame(data=np.c_[y_test, y_pred_test], columns=["Actual Class", "Predicted Class"])
PCs_df_test = pd.concat([PCs_df_test.reset_index(drop=True), actual_and_predictions_test.reset_index(drop=True)], axis=1)
In [107]:
fig = px.scatter(PCs_df_test, x="PC1", y="PC2", color="Actual Class", symbol="Predicted Class", title="Top 2 MFCC Principal Components of Test Set Predictions")
fig.show()
In [108]:
fig = px.scatter_3d(PCs_df_test, x="PC1", y="PC2", z="PC3", color="Actual Class", symbol="Predicted Class")

fig.update_layout(margin=dict(l=10, r=10, b=10, t=40), title='Top 3 MFCC Principal Components of Test Set Predictions')
fig.update_scenes(xaxis_autorange="reversed", yaxis_autorange="reversed")
fig.show()